Pandas Practice Questions¶

This notebook contains 20 comprehensive Python pandas practice problems organized in two sections:

Section A - Short Coding Questions (Questions 1-17):

  • Questions 1-12: Basic pandas operations (loading, selection, filtering, handling missing values)
  • Questions 13-17: Short coding questions on duplicates, missing values, column creation, filtering, and statistics

Section B - Applied Coding Questions (Questions 18-20):

  • Question 18: GroupBy with multiple aggregations
  • Question 19: Advanced filtering and column creation
  • Question 20: Handling missing values and outliers

Each question includes:

  • Clear problem description
  • Hints for solving
  • Multiple-choice code options (where applicable)
  • Instructor solution with inline examples
  • Test cases using small DataFrames
In [3]:
import pandas as pd
import numpy as np
from io import StringIO
In [15]:
name = 'Anay Mittal'
roll_number = '2423357'

1. Load a CSV string into a DataFrame¶

Return: A pandas DataFrame from the CSV string

Choose the correct line:

  • (a) return pd.read_excel(StringIO(csv_string))
  • (b) return pd.read_csv(StringIO(csv_string))
  • (c) return pd.DataFrame(csv_string.split('\n'))
  • (d) return csv_string.to_dataframe()
In [1]:
def load_csv_string(csv_string: str) -> pd.DataFrame:
    return pd.read_csv(StringIO(csv_string))
# csv_data = 'name,age,score\nAlice,25,85\nBob,30,90\nCharlie,22,78'
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[1], line 1
----> 1 def load_csv_string(csv_string: str) -> pd.DataFrame:
      2     return pd.read_csv(StringIO(csv_string))

NameError: name 'pd' is not defined
  1. Get shape and column names

Return:A tuple of (number of rows, number of columns, list of column names)

Choose the correct code:

  • (a) return (df.size, df.ndim, df.columns)
  • (b) return (df.shape[0], df.shape[1], list(df.columns))
  • (c) return df.info()
  • (d) return (len(df), len(df.index), df.to_list())
In [9]:
def get_dataframe_info(df: pd.DataFrame) -> tuple:
   return df.info()

3. Get the first n rows of a DataFrame¶

Return: DataFrame containing first n rows

Choose the correct code:

  • (a) return df.iloc[:n]
  • (b) return df.head(n)
  • (c) return df.nlargest(n, axis=0)
  • (d) return df[:n:1]
In [8]:
def get_first_n_rows(df: pd.DataFrame, n: int) -> pd.DataFrame:
    return df.head(n)

4. Get basic statistics for numeric columns¶

Return: A pandas DataFrame with descriptive statistics (using .describe())

In [10]:
def describe_numeric(df: pd.DataFrame) -> pd.DataFrame:
    df.describe()

5. Select a single column as a Series¶

Return: A pandas Series for the specified column

In [16]:
def select_column(df: pd.DataFrame, col_name: str) -> pd.Series:
    return df[col_name]

6. Filter rows where a column value exceeds a threshold¶

Return: A DataFrame containing only rows where column > threshold

Hint: Use boolean indexing df[df[col_name] > threshold] and .reset_index(drop=True) to reset row indices.

Choose the correct code:

  • (a) return df.filter(column=col_name, value=threshold)
  • (b) return df.loc[df[col_name] > threshold]
  • (c) return df[df[col_name] > threshold].reset_index(drop=True)
  • (d) return df.query(f'{col_name} > {threshold}')
In [12]:
def filter_by_threshold(df: pd.DataFrame, col_name: str, threshold: float) -> pd.DataFrame:
    return df[df[col_name] > threshold].reset_index(drop=True)

7. Count missing (NaN) values in each column¶

Return: A pandas Series with column names as index and count of NaN as values

Hint: Use .isnull().sum() to count missing values in each column.

In [14]:
def count_missing_values(df: pd.DataFrame) -> pd.Series:
    Series.isnull()
    Series.sum()

8. Drop rows containing any NaN values¶

Return: A DataFrame with all rows containing NaN removed

Hint: Use .dropna() to remove rows with missing values, then .reset_index(drop=True) to renumber rows.

In [ ]:
def drop_rows_with_nan(df: pd.DataFrame) -> pd.DataFrame:
    df.dropna()
    df.reset_index(drop=True)

9. Fill missing values with the mean of the column¶

Return: A DataFrame where NaN values in numeric columns are replaced by column mean

Hint: Get numeric columns using .select_dtypes(), then use .fillna() with the column mean.

In [ ]:
def fill_missing_with_mean(df: pd.DataFrame) -> pd.DataFrame:
    df.select_types()
    df.fillna()

10. Group by a column and calculate the mean of another column¶

Return: A DataFrame with grouped results (group column and mean)

Hint: Use .groupby(group_col)[agg_col].mean() and .reset_index() to convert to DataFrame.

In [ ]:
def group_by_mean(df: pd.DataFrame, group_col: str, agg_col: str) -> pd.DataFrame:
    pass

11. Merge two DataFrames on a common column¶

Return: A merged DataFrame (inner join on the specified key)

Choose the correct code:

  • (a) return left.join(right, on=on)
  • (b) return pd.concat([left, right])
  • (c) return pd.merge(left, right, on=on, how='inner')
  • (d) return left.combine(right)
In [11]:
def merge_dataframes(left: pd.DataFrame, right: pd.DataFrame, on: str) -> pd.DataFrame:
    return pd.merge(left, right, on=on, how='inner')

12. Convert a column to datetime format¶

Return: A DataFrame where the specified column has been converted to datetime

Choose the correct code:

  • (a) df_copy[col_name] = df_copy[col_name].astype(datetime)
  • (b) df_copy[col_name] = pd.to_datetime(df_copy[col_name])
  • (c) df_copy[col_name].convert_to_datetime()
  • (d) df_copy[col_name] = datetime.strptime(df_copy[col_name], '%Y-%m-%d')
In [17]:
def convert_to_datetime(df: pd.DataFrame, col_name: str) -> pd.DataFrame:
    df_copy[col_name] = pd.to_datetime(df_copy[col_name])

13. Drop Duplicate Rows¶

You have a DataFrame with duplicate rows. The command drop_duplicates on subset of columns named ['Name', 'Team'] is to be used.

In [18]:
def drop_duplicates_by_cols(df: pd.DataFrame) -> pd.DataFrame:
    df.drop_duplicates(['Name','Team'])
sample_df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Alice', 'Charlie'], 'Team': ['X', 'Y', 'X', 'Z'], 'Salary': [50000, 55000, 50000, 60000]})

14. Fill Missing Values in a Column¶

Write a Python command to fill all missing values in the column 'College' with the text 'Unknown'.

In [20]:
def fill_missing_college(df: pd.DataFrame) -> pd.DataFrame:
    df.['College']=['Unkown']
# sample_df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie'], 'College': ['IIT', None, 'NIT']})
  Cell In[20], line 2
    df.['College']=['Unkown']
       ^
SyntaxError: invalid syntax

15. Create New Column with Percentage Increase¶

Given a DataFrame df with a 'Salary' column, write code to increase salary by 5% and store it in a new column 'UpdatedSalary'.

Hint: Multiply the Salary column by 1.05 to increase by 5%.

Choose the correct code:

  • (a) df['UpdatedSalary'] = df['Salary'] * 5
  • (b) df['UpdatedSalary'] = df['Salary'] * 1.05
  • (c) df['UpdatedSalary'] = df['Salary'] + 0.05
  • (d) df['UpdatedSalary'] = df['Salary'].apply(lambda x: x * 5)
In [19]:
def add_updated_salary(df: pd.DataFrame) -> pd.DataFrame:
    df['UpdatedSalary'] = df['Salary'] * 1.05
# sample_df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie'], 'Salary': [50000, 55000, 60000]})

16. Filter Rows with Range Condition¶

Write Python code to select rows where 'Profit' is between 30 and 55 (inclusive).

Hint: Use boolean indexing with AND operator &.

In [ ]:
def filter_profit_range(df: pd.DataFrame) -> pd.DataFrame:
    pass
# sample_df = pd.DataFrame({'Name': ['A', 'B', 'C', 'D'], 'Profit': [25, 40, 60, 35]})

17. Get Summary Statistics¶

Write a Python command to show summary statistics (mean, median, std, min, max, etc.) for the entire DataFrame.

In [21]:
def get_summary_stats(df: pd.DataFrame) -> pd.DataFrame:
    df.describe()
# sample_df = pd.DataFrame({'Age': [25, 30, 28, 35], 'Salary': [50000, 55000, 52000, 60000]})

18. GroupBy with Multiple Aggregations¶

You have a DataFrame with columns: Name, Team, Salary, Profit

Write Python code to:

  1. Group the data by Team
  2. aggregate average salary and total profit for each team
  3. return the result

Hint: Use .groupby() with .agg() for multiple aggregations.

Choose the correct code:

  • (a) df.groupby('Team').agg({'Salary': 'mean', 'Profit': 'sum'})
  • (b) df.groupby('Team')[['Salary', 'Profit']].agg(['mean', 'sum'])
  • (c) df.group('Team').apply(lambda x: {'avg_salary': x['Salary'].mean(), 'total_profit': x['Profit'].sum()})
In [22]:
def groupby_team_agg(df: pd.DataFrame) -> pd.DataFrame:
    df.groupby('Team').agg({'Salary': 'mean', 'Profit': 'sum'})
# sample_df = pd.DataFrame({'Name': ['A', 'B', 'C', 'D'], 'Team': ['X', 'X', 'Y', 'Y'], 'Salary': [50000, 55000, 52000, 53000], 'Profit': [45, 30, 60, 25]})

19. Advanced Filtering and Column Creation¶

Given a DataFrame with columns: Name, Score1, Score2

Write Python code to:

  1. Select only rows where Score1 > 40 AND Score2 > 50
  2. Create a new column AverageScore = mean of Score1 and Score2
  3. return dataframe with only the [['Name', 'AverageScore']]

Hint: Filter first using boolean indexing, then add the new column, then select specific columns.

In [23]:
def advanced_filter_and_create(df: pd.DataFrame) -> pd.DataFrame:
# sample_df = pd.DataFrame({'Name': ['A', 'B', 'C', 'D'], 'Score1': [40, 55, 70, 30], 'Score2': [50, 65, 75, 35]})

20. Handle Missing Values and Outliers¶

You have a DataFrame with an 'Age' column containing missing values and outliers (Age > 100).

Write Python code to:

  1. Replace missing values with the median age
  2. Remove rows where Age > 100
  3. Return the cleaned DataFrame

Hint: Use .fillna() with median, then filter with boolean indexing.

In [ ]:
def clean_age_data(df: pd.DataFrame) -> pd.DataFrame:
    pass
# sample_df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'], 'Age': [25, np.nan, 105, 30, np.nan]})